Multiple Linear Regression in R

BSTA 6100 Fall 2025

Nicholas J. Seewald, PhD

2025-11-25

Example Setup

A study was carried out to examine the relationship between body weight (measured in pounds) and covariates age (years) and height (inches) among children aged 6-12.

d <- read.csv("MLR_example_child.csv")
str(d)
'data.frame':   12 obs. of  4 variables:
 $ id : int  1 2 3 4 5 6 7 8 9 10 ...
 $ wt : int  64 71 53 67 55 58 77 57 56 51 ...
 $ ht : int  57 59 49 62 51 50 55 48 42 42 ...
 $ age: int  8 10 6 11 8 7 10 9 10 6 ...

Simple Linear Regression

First, we’ll use simple linear regression to estimate an association between height and weight.

mod1 <- lm(wt ~ ht, data = d)
summary(mod1)

Call:
lm(formula = wt ~ ht, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8736 -3.8973 -0.4402  2.2624 11.8375 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   6.1898    12.8487   0.482  0.64035   
ht            1.0722     0.2417   4.436  0.00126 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.471 on 10 degrees of freedom
Multiple R-squared:  0.663, Adjusted R-squared:  0.6293 
F-statistic: 19.67 on 1 and 10 DF,  p-value: 0.001263

Simple Linear Regression

Next, we’ll use simple linear regression to estimate an association between age and weight.

mod2 <- lm(wt ~ age, data = d)
summary(mod2)

Call:
lm(formula = wt ~ age, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.000  -3.911   1.143   4.071  10.000 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  30.5714     8.6137   3.549  0.00528 **
age           3.6429     0.9551   3.814  0.00341 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.015 on 10 degrees of freedom
Multiple R-squared:  0.5926,    Adjusted R-squared:  0.5519 
F-statistic: 14.55 on 1 and 10 DF,  p-value: 0.003407

Inference on SLR Parameters

Estimate a 90% confidence interval for the association between age and weight.

confint(mod2, parm = "age", level = 0.9)
         5 %     95 %
age 1.911748 5.373966

Inference on SLR Fitted Values

Estimate a 95% confidence interval for the weight of a child who is 53 inches tall.

predict(mod1, 
        newdata = data.frame("ht" = 53),
        interval = "confidence")
       fit      lwr      upr
1 63.01806 59.49644 66.53968

Estimate a 95% prediction interval for the weight of a child who is 53 inches tall.

predict(mod1, 
        newdata = data.frame("ht" = 53),
        interval = "prediction")
       fit      lwr      upr
1 63.01806 50.32924 75.70687

Multiple Linear Regression

To fit a multiple regression model in R, add covariates to the formula object:

mod3 <- lm(wt ~ ht + age, data = d)
summary(mod3)

Call:
lm(formula = wt ~ ht + age, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8708 -1.7004  0.3454  1.4642 10.2336 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   6.5530    10.9448   0.599   0.5641  
ht            0.7220     0.2608   2.768   0.0218 *
age           2.0501     0.9372   2.187   0.0565 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.66 on 9 degrees of freedom
Multiple R-squared:   0.78, Adjusted R-squared:  0.7311 
F-statistic: 15.95 on 2 and 9 DF,  p-value: 0.001099

Multiple Linear Regression: Interpretation

summary(mod3)

Call:
lm(formula = wt ~ ht + age, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8708 -1.7004  0.3454  1.4642 10.2336 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   6.5530    10.9448   0.599   0.5641  
ht            0.7220     0.2608   2.768   0.0218 *
age           2.0501     0.9372   2.187   0.0565 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.66 on 9 degrees of freedom
Multiple R-squared:   0.78, Adjusted R-squared:  0.7311 
F-statistic: 15.95 on 2 and 9 DF,  p-value: 0.001099

Interpret each of the parameters.

Multiple Linear Regression: Interpretation

summary(mod3)

Call:
lm(formula = wt ~ ht + age, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8708 -1.7004  0.3454  1.4642 10.2336 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   6.5530    10.9448   0.599   0.5641  
ht            0.7220     0.2608   2.768   0.0218 *
age           2.0501     0.9372   2.187   0.0565 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.66 on 9 degrees of freedom
Multiple R-squared:   0.78, Adjusted R-squared:  0.7311 
F-statistic: 15.95 on 2 and 9 DF,  p-value: 0.001099

Interpret the F test.

Multiple Linear Regression: Prediction

Given a fitted lm object, we can get estimated means / predicted outcomes using predict():

predict(mod3, 
        newdata = data.frame(
          "ht" = mean(d$ht),
          "age" = mean(d$age)
        ),
        se.fit = T,
        interval = "confidence", # or "prediction"
        level = 0.95)
$fit
    fit      lwr      upr
1 62.75 59.70699 65.79301

$se.fit
[1] 1.345181

$df
[1] 9

$residual.scale
[1] 4.659845

The newdata argument is an optional data frame containing values of the covariates at which you want to predict the outcome. So the above code returns the estimated mean weight of a child with average height and age from the data: \hat{Y} = 6.553 + 0.7220(52.75) + 2.0501(8.833) = 62.75 \text{ lbs}.

Multiple Linear Regression: Interval Estimation

Given a fitted lm object, we can get estimated confidence intervals for the parameters using the confint() function.

We can get CIs for all parameters:

confint(mod3, level = 0.95)
                   2.5 %    97.5 %
(Intercept) -18.20587071 31.311967
ht            0.13205592  1.312020
age          -0.07002526  4.170278

Or just some, using

confint(mod3, parm = "ht", level = 0.95)
       2.5 %  97.5 %
ht 0.1320559 1.31202

Palmer Penguins Data

penguins <- read.csv("penguins.csv", stringsAsFactors = TRUE)
head(penguins)
  species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           39.1          18.7               181        3750
2  Adelie Torgersen           39.5          17.4               186        3800
3  Adelie Torgersen           40.3          18.0               195        3250
4  Adelie Torgersen           36.7          19.3               193        3450
5  Adelie Torgersen           39.3          20.6               190        3650
6  Adelie Torgersen           38.9          17.8               181        3625
     sex year
1   male 2007
2 female 2007
3 female 2007
4 female 2007
5   male 2007
6 female 2007

MLR Model

reg1 <- lm(body_mass_g ~ bill_length_mm + flipper_length_mm, 
           data = penguins)
summary(reg1)

Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm, 
    data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1083.08  -282.65   -17.18   242.95  1293.24 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5836.299    312.604 -18.670   <2e-16 ***
bill_length_mm        4.959      5.214   0.951    0.342    
flipper_length_mm    48.890      2.034  24.034   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393.4 on 330 degrees of freedom
Multiple R-squared:  0.7627,    Adjusted R-squared:  0.7613 
F-statistic: 530.4 on 2 and 330 DF,  p-value: < 2.2e-16

Updating an MLR model

We can add or subtract covariates to an existing lm() object using update():

reg2 <- update(reg1, 
               ~ . + flipper_length_mm + species)
summary(reg2)

Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm + 
    species, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-811.52 -227.81  -33.74  216.73 1053.26 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3864.073    533.592  -7.242 3.18e-12 ***
bill_length_mm       60.117      7.207   8.341 2.06e-15 ***
flipper_length_mm    27.544      3.209   8.583 3.77e-16 ***
speciesChinstrap   -732.417     82.063  -8.925  < 2e-16 ***
speciesGentoo       113.254     89.206   1.270    0.205    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 339.6 on 328 degrees of freedom
Multiple R-squared:  0.8243,    Adjusted R-squared:  0.8222 
F-statistic: 384.7 on 4 and 328 DF,  p-value: < 2.2e-16

MLR with categorical covariates

summary(reg2)

Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm + 
    species, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-811.52 -227.81  -33.74  216.73 1053.26 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3864.073    533.592  -7.242 3.18e-12 ***
bill_length_mm       60.117      7.207   8.341 2.06e-15 ***
flipper_length_mm    27.544      3.209   8.583 3.77e-16 ***
speciesChinstrap   -732.417     82.063  -8.925  < 2e-16 ***
speciesGentoo       113.254     89.206   1.270    0.205    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 339.6 on 328 degrees of freedom
Multiple R-squared:  0.8243,    Adjusted R-squared:  0.8222 
F-statistic: 384.7 on 4 and 328 DF,  p-value: < 2.2e-16

Interpret the species parameters.

Hypothesis Testing in Multiple Linear Regression

Omnibus Tests

Above, we fit a model reg2: Y_i = \beta_0 + \beta_1 \text{(bill length)}_i + \beta_2 \text{(flipper length)}_i + \beta_3 \text{chinstrap}_i + \beta_4 \text{gentoo}_i + \epsilon_i.

Consider testing \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0, the omnibus test of all parameters in the model. We can do this just from the R output for reg2:

summary(reg2)

Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm + 
    species, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-811.52 -227.81  -33.74  216.73 1053.26 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3864.073    533.592  -7.242 3.18e-12 ***
bill_length_mm       60.117      7.207   8.341 2.06e-15 ***
flipper_length_mm    27.544      3.209   8.583 3.77e-16 ***
speciesChinstrap   -732.417     82.063  -8.925  < 2e-16 ***
speciesGentoo       113.254     89.206   1.270    0.205    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 339.6 on 328 degrees of freedom
Multiple R-squared:  0.8243,    Adjusted R-squared:  0.8222 
F-statistic: 384.7 on 4 and 328 DF,  p-value: < 2.2e-16

Covariate Subset Tests

Now consider testing the null that \beta_3 = \beta_4 = 0, i.e., that mean penguin body mass does not vary by species, adjusted for bill length & flipper length. This is a covariate subset test. We can do this by fitting a reduced model and using the anova() function.

reg2_reduced <- update(reg2, ~ . - species)
summary(reg2_reduced)

Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm, 
    data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1083.08  -282.65   -17.18   242.95  1293.24 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5836.299    312.604 -18.670   <2e-16 ***
bill_length_mm        4.959      5.214   0.951    0.342    
flipper_length_mm    48.890      2.034  24.034   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393.4 on 330 degrees of freedom
Multiple R-squared:  0.7627,    Adjusted R-squared:  0.7613 
F-statistic: 530.4 on 2 and 330 DF,  p-value: < 2.2e-16
anova(reg2_reduced, reg2)
Analysis of Variance Table

Model 1: body_mass_g ~ bill_length_mm + flipper_length_mm
Model 2: body_mass_g ~ bill_length_mm + flipper_length_mm + species
  Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
1    330 51071963                                  
2    328 37820194  2  13251769 57.464 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

General Linear Hypothesis Test

We can use the glht() function from the multcomp package (note: install multcomp using install.packages("multcomp")) to test general linear hypotheses.

Let’s look at a model for penguin body mass against bill length, flipper length, and island.

reg4 <- lm(body_mass_g ~ bill_length_mm + flipper_length_mm + island, data = penguins)
summary(reg4)

Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm + 
    island, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1017.04  -264.91   -18.85   217.82  1370.07 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -4355.027    403.515 -10.793  < 2e-16 ***
bill_length_mm       16.883      5.601   3.014  0.00278 ** 
flipper_length_mm    39.656      2.535  15.645  < 2e-16 ***
islandDream        -333.716     58.736  -5.682 2.95e-08 ***
islandTorgersen    -190.960     70.748  -2.699  0.00731 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 376.5 on 328 degrees of freedom
Multiple R-squared:  0.784, Adjusted R-squared:  0.7814 
F-statistic: 297.7 on 4 and 328 DF,  p-value: < 2.2e-16

Let’s consider testing whether the difference in mean body mass on Dream and Torgersen islands is zero, adjusting for bill length & flipper length.

First, set up the hypothesis test:

H_0: \quad \text{vs.} \quad H_1:

GLHT Method 1: Symbolic Description

To compare factor levels, use the mcp() function.

islandTest <- glht(reg4, linfct = mcp(island = "Dream - Torgersen = 0"))
summary(islandTest)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: User-defined Contrasts


Fit: lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm + 
    island, data = penguins)

Linear Hypotheses:
                       Estimate Std. Error t value Pr(>|t|)  
Dream - Torgersen == 0  -142.76      69.72  -2.047   0.0414 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

GLHT Method 2: Contrast Matrix

We could also specify a contrast vector or matrix:

coef(reg4)
      (Intercept)    bill_length_mm flipper_length_mm       islandDream 
      -4355.02693          16.88279          39.65617        -333.71627 
  islandTorgersen 
       -190.95953 
K <- rbind("Dream - Torgersen" = c(0, 0, 0, 1, -1))
K
                  [,1] [,2] [,3] [,4] [,5]
Dream - Torgersen    0    0    0    1   -1
islandTest2 <- glht(reg4, linfct = K)
summary(islandTest2)

     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm + 
    island, data = penguins)

Linear Hypotheses:
                       Estimate Std. Error t value Pr(>|t|)  
Dream - Torgersen == 0  -142.76      69.72  -2.047   0.0414 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

Multiple Linear Regression Assumptions

Linearity

With multiple covariates, the linearity assumption becomes even more critical. Not only do we have to correctly model the form of the relationship between Y and X_1, but also X_2, X_3, etc. We can use a plot of residuals vs. fitted values to check this assumption. We could also make partial regression plots to check.

library(car)
crPlots(reg2)

Outliers: Leverage

Leverage is a measure of how much the data from the ith observation contribute to the fitted value \hat{Y}_i. High leverage means that X_i strongly determines \hat{Y}_i. Remember \hat{Y} = HY = {X}^\top \left({X}^\top {X}\right)^{-1} {X} {Y}.

The leverage for the ith observation is the (i,i)th element (the ith diagonal) of the hat matrix H, h_{ii} = x_{i}^\top (X^\top X)^{-1} x_{i}, and reflects outlyingness in X space.

Outliers: Studentized Residuals

The standardized residuals are scaled by an estimate of the errors’ variance. We could also scale by an estimate of the residuals’ variance. The

Remember,

\operatorname{Var}(\hat{\epsilon}) = \sigma^2(I_n - H),

where H = X^\top (X^\top X)^{-1} X is the hat matrix and I_n is the n\times n identity matrix. The diagonal elements of the hat matrix, denoted h_{ii} are called leverage. The variance of a single residual is \operatorname{Var}(\hat{\epsilon}_i) = \sigma^2 (1 - h_{ii}). The studentized residuals are

\hat{t}_{i} = \frac{\hat{\epsilon}_i}{\hat{\sigma}^2 \sqrt{1-h_{ii}}}.

The studentized residuals have constant variance 1 regardless of the location of x_i when the model is correctly specified. Studentized residuals are useful for examining outliers because they’re leverage-adjusted: they account for the fact that the variance of the residuals changes with X_i.

Studentized residuals are approximately distributed with a standard normal distribution: “large” values (per a standard normal distribution) indicate outlyingness.

Outliers: Jackknife Residuals

The jackknife residuals are similar to studentized residuals, but refit to remove the ith individual:

\hat{r}_{(-i)} = \frac{\hat{\epsilon}_i}{\hat{\sigma}_{(-i)} \sqrt{1 - h_{ii}}},

where \hat{\sigma}_{(-i)}^2 is an estimator of \sigma^2 with the ith individual removed (i.e., that doesn’t use data from individual i).

Jackknife residuals completely remove the influence of the ith individual in determining whether it’s an outlier, because \hat{\epsilon}_i can mask its own outlyingness:

  • For points with high leverage, |\hat{\epsilon}_i| may be small (remember the variance of the residual is \sigma^2 (I - H))

  • If |\hat{\epsilon}_i| is large, \hat{\sigma}^2 might over-estimate \sigma^2, so things like the studentized residuals might be too small.

Influence

Influence measures the “pull” a single point has on a fitted regression line. Highly influential points draw the fitted line closer to them, possibly biasing results. They typically have both

  • high leverage

  • departure from trend in other data points

but one or the other does not necessarily imply influence. An influential point is one with high leverage and outlyingness.

One measure of influence is Cook’s Distance:

D_i = \frac{\left(\hat{\beta}_{(-i)}-\hat{\beta}\right)^\top \left(X^\top X\right)\left(\hat{\beta}_{(-i)} - \hat{\beta}\right)}{(p+1)\hat{\sigma}^2}.

Cook’s distance is a scaled distance between the fitted regression coefficients after removing observation i and the fitted regression coefficients that use observation i. If D_i > 1, generally we’ll say the observation is influential.

We can plot Cook’s distance for all points in the model using the which argument to plot.lm():

plot(reg2, which = 4)